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Image of an entity can be defined as a structnred and dynamic 
representation which can be extracted from the opinions of a gronp 
of users or population. Automatic extraction of such an image has 
certain importance in political science and sociology related studies, 
e.g., when an extended inquiry from large-scale data is required. We 
study the images of two politically significant entities of France. These 
images are constructed by analyzing the opinions collected from a 
well known social media called Twitter. Our goal is to build a system 
which can be used to automatically extract the image of entities over 
time. 

In this paper, we propose a novel evolutionary clustering method 
based on the parametric link among Multinomial mixture models. 

First we propose the formulation of a generalized model that estab¬ 
lishes parametric links among the Multinomial distributions. After¬ 
ward, we follow a model-based clustering approach to explore dif¬ 
ferent parametric sub-models and select the best model. For the ex¬ 
periments, first we use synthetic temporal data. Next, we apply the 
method to analyze the annotated social media data. Results show 
that the proposed method is better than the state-of-the-art based 
on the common evaluation metrics. Additionally, our method can 
provide interpretation about the temporal evolution of the clusters. 


1. Introduction. We define an image as a multi-faceted representa¬ 
tion that aggregates a set of opinions or general impressions regarding an 
entity. By entity, we mean a politician, a celebrity, a company, a brand, 
etc. In this research, we are particularly interested to use annotated so¬ 
cial media data to extract the image of two French politicians and observe 
its changes/evolution over time. We consider the annotated data from the 
ImagiWeb project (Velcin et ah, 2014) which are extracted before and after 
the 2012 French presidential election. The annotation provides a compact 
and meaningful representation for each tweet. Our goal is to develop a tem¬ 
poral/evolutionary clustering technique, which groups the annotated opin¬ 
ions and then extracts the image of an entity over time from the clustering 
results. Subsequently, we want to explain/interpret the temporal changes of 
the image created from each group of users. 
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In the recent years, the social media plays a significant role in many 
aspects of our daily activity. There exist numerous popular social media 
such as Twitter or Facebook, where the users (people) often provide their 
opinions about particular entity, e.g., persons (politician, actor), products 
consumed in the daily life, etc. A common method to analyze such data is 
to use a clustering method that naturally groups the users/opinions, and 
then investigate each group independently. An important property of these 
data is that they may change over time due to changes of the attributes, 
and appearance/disappearance of users. Moreover, users may change their 
opinion about the targeted entity. 

An ordinary clustering method is unlikely to adapt with such tempo¬ 
ral dynamics of the data, as it does not consider any relevant information 
such as history and temporal effects. The notion of evolutionary clustering 
(Chakrabarti, Kumar and Tomkins, 2006; Xu, Kliger and Hero lii, 2014; 
Chi et ah, 2009; Xu et ah, 2012) appears in such situations, where the 
method should be specialized in clustering temporal data by taking care 
of the historic information and current data altogether. Numerous methods 
exist, which address these issues appropriately and cluster temporal data. 
These methods are based on different strategies, such as spectral clustering 
(Chi et ah, 2009; Xu, Kliger and Hero lii, 2014) and probabilistic gener¬ 
ative model (Blei and Lafferty, 2006; Xu et ah, 2012; Kim et ah, 2015). 
However, it remains an important issue - how to interpret the evolution of 
the clusters. In this research, we are motivated by this issue and propose 
a novel method based on the Multinomial mixture model (Bishop et ah, 
2006) to cluster the temporal data as well as interpret the evolution of the 
clusters through some prior belief. Therefore, we propose a novel method 
which simultaneously performs evolutionary clustering and interpreting the 
evolution. 

Multinomial Mixture (MM) model based clustering strategy is a popular 
method for clustering discrete data (Meila and Heckerman, 2001; Silvestre, 
Cardoso and Figueiredo, 2014; Hasnat, Alata and Tremeau, 2015; Agresti, 
2002). Most recently, it has been exploited to perform evolutionary clustering 
(Kim et ah, 2015). In this research, we consider MM as the core model for the 
data and propose an evolutionary clustering method by deriving appropriate 
link between the parameters of MM at different time. 

Parametric link among probability distributions has been used in the con¬ 
text of transfer learning (Biernacki, Beninel and Bretagnolle, 2002; Jacques 
and Biernacki, 2010; Beninel et ah, 2012), where the goal is to adapt a 
clustering model from a source population to a target one. In the context 
of continuous features, Biernacki, Beninel and Bretagnolle (2002) proposed 
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a parametric link between the Normal distributions. Jacques and Biernacki 
(2010) extended it for the binary features using Bernoulli distribution. How¬ 
ever, no such formulation exists for the Multinomial distribution. Moreover, 
such parametric link-based methods are never considered in the context of 
evolutionary clustering. This research addresses both of these issues. 

This research proposes a novel evolutionary clustering method for extract¬ 
ing image of political entities. The highlights of our contributions include: 
(a) propose a formulation for a parametric link among Multinomial distri¬ 
butions; (b) develop a novel evolutionary clustering method by exploiting 
the link parameters and (c) provide interpretation of the link parameters 
to interpret cluster evolutions. First, we use synthetic data to evaluate and 
compare the proposed method w.r.t. the state-of-the-art methods. Next, we 
apply it to analyze the temporal dynamics of social media data obtained 
from the ImagiWeb project (Velcin et ah, 2014). Results in Sec. 4 show that 
the proposed method is better than the state-of-the-art methods. 

In the rest of the paper, we present the data in Sec. 2, describe our 
proposed method in Sec. 3, present the experimental results in Sec. 4, provide 
analysis of the political data in Sec. 5, and hnally draw conclusions in Sec. 
6 . 


2. The Imagiweb project and the political opinion dataset. We 

collected data from the political opinion dataset of the ImagiWeb^ (IW- 
POD) project, see Velcin et al. (2014) for further details of data collec¬ 
tion, relevant statistics and representation. IW-POD consists of manually 
annotated tweets, from May 2012 to January 2013, related to two French 
politicians: Francois Hollande (FH) and Nicolas Sarkozy (NS). First, these 
tweets are annotated into 11 different aspects, such as Attribute (Att), Per¬ 
son (Per), Entity (Ent), Skills (Ski), Political line (Pol), Balance (Bal), In¬ 
junction (Inj), Projet (Pro), Ethic (Eth), Communication (Com) and No 
aspect detected (N/A). Afterward, each aspect is annotated with 6 opin¬ 
ion polarities, such as very negative (-2), negative (-1), no polarity (0), Null, 
positive (j-l) and very positive (-1-2). Eor example, the tweet - Sarko is more 
rational (orig: Sarko est plus rationnel) is annotated with the aspect called 
Person and polarity -|-1. It is about NS and indicates that the user provides 
positive opinion with an emphasis on the personal attribute. Another exam¬ 
ple, the tweet - Nicolas Sarkozy, the worst president of the Fifth Republic 
(Orig: Nicolas Sarkozy, le plus mauvais president de la Verne Republique) 
is annotated with the aspect called Skill and polarity —1. It is a negative 
opinion about NS and indicates that the user emphasizes on the skill of NS. 


^ http://mediamining.univ-lyon2.fr/velcm/imagiweb/dataset.html 
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In order to use these tweets for clustering, they are regrouped within 
the specified time epoch. Moreover, similar polarities are merged, e.g., two 
positives (+1 and +2) are merged into one as only positive (+). Therefore, 
each aspect consits of four polarities, such as positive (+), negative (-), zero 
(0) and undefined/null (0). As a consequence, finally each regrouped tweet 
represents the opinion of an user about a particular politician which is a 
44 (11 X 4) dimensional vector of discrete data. In our experiment, we group 
opinions from IW-POD into three time^ epochs: tl, t2 and t3, see Table 1 
for details of the temporal data. Moreover, since the true number of clusters 
is unknown, we run clustering for different numbers of clusters ranging from 
3 to 9. 


Table 1 

Details of the IW-POD dataset which is divided into three time periods. Each observation 
consists of a 44 dimensional discrete valued vector that encodes information about 11 
different aspects each having 4 polarities. 


Time 

stamp 

Time 

period 

Significance 

Num. opinions 
N. Sarkozy 

Num. opinions 
F. Hollande 

tl 

03/12 - 06/12 

Before and 
After Election 

1018 

1168 

t2 

07/12 - 10/12 

After Election 

1067 

1079 

t3 

11/12 - 01/13 

After Election 

1079 

708 


3. Parametric Link Based Evolutionary Clustering. We adopt 
the parametric link approach (Biernacki, Beninel and Bretagnolle, 2002; 
Jacques and Biernacki, 2010) for evolutionary clustering by assuming that 
the source samples are equivalent to the samples at time epoch t and target 
samples represent sample of time t+1. With this assumption, we incorporate 
linear link between Multinomials at different time epoch. The algorithm for 
the proposed clustering method is presented in Algorithm 1. 

3.1. Related work. Evolutionary Clustering (ECL), also called cluster¬ 
ing over time, aims to cluster the data that dynamically evolves over time 
(Chakrabarti, Kumar and Tomkins, 2006). Ordinary clustering methods are 
not appropriate as they group/partition the data samples only based on 
the certain properties of the data. In contrary, ECL methods cluster the 
data by additionally considering the temporal smoothness to reflect the 
long-term trends of the data while being robust to the short-term varia¬ 
tions (Chakrabarti, Kumar and Tomkins, 2006; Xu, Kliger and Hero lii, 

^The first round of the presidential election was held in 22/04/2012 and the second 
round run-off was held on 06/05/2012. Therefore, the data collected during this election 
period belong to time epoch tl. 
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2014; Chi et ah, 2009). ECL should maintain four properties (Chakrabarti, 
Kumar and Tomkins, 2006) such as consistency, noise removal, smoothing 
and cluster correspondence. The demand and application of such clustering 
method are increasing rapidly due to the significant growth of the dynamic 
data in numerous domains. It has been successfully applied to analyze news 
(Xu et ah, 2012), social media (Kim et ah, 2015), stock price (Xu, Kliger and 
Hero lii, 2014), photo-tag pairs (Chakrabarti, Kumar and Tomkins, 2006), 
and documents (Blei and Lafferty, 2006). 

Temporal/evolutionary data clustering has been addressed from several 
viewpoints in the literature, which naturally raises several task-specihc no¬ 
tions about ECL. A distinction among them can be as follows: (1) clustering 
(2) monitoring and (3) interpreting. In the following paragraphs, we review 
relevant literature based on this distinction. 

Following the definition of Chakrabarti, Kumar and Tomkins (2006), the 
ECL method clusters data by considering the historic information and cur¬ 
rent data. Based on this definition, in this research we do not consider the 
methods which do not take into account the historic information. Besides, in 
order to limit our focus on the parametric methods, we do not consider the 
methods from non-parametric Bayesian based approaches (Xu et ah, 2008; 
Dubey et ah, 2013; Kharratzadeh, Renard and Coates, 2015). 

Numerous methods based on different techniques have been proposed 
in the literature (Chakrabarti, Kumar and Tomkins, 2006; Xu, Kliger and 
Hero lii, 2014; Chi et ah, 2009; Xu et ah, 2012; Kim et ah, 2015; Blei and 
Lafferty, 2006). Chakrabarti, Kumar and Tomkins (2006) provided a generic 
framework for this problem and proposed evolutionary version of k-means 
and hierarchical agglomerative clustering methods. Their proposed frame¬ 
work is based on optimizing a global cost function that consists of snapshot 
(static clustering) quality and history cost (temporal smoothness). This is 
considered as the first work for evolutionary clustering and has been subse¬ 
quently extended by other researchers. Chi et ah (2009) proposed two evo¬ 
lutionary clustering methods based on spectral clustering strategy. In their 
approach, they added terms within the clustering cost functions in order to 
regularize the temporal smoothness. Xu, Kliger and Hero lii (2014) recently 
proposed AFFECT, which performs adaptive evolutionary clustering by es¬ 
timating an optimal smoothing parameter. This approach is extended with 
several static clustering methods, such as k-means, hierarchical and spectral. 
A common property of these methods is that they specialized for continuous 
data and hence may not be an appropriate choice for clustering categorical 
data that is our concern in this research. 

Dynamic Topic Model (DTM) is a well-known probabilistic method for 
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analyzing temporal categorical data (Blei and LafFerty, 2006). It was origi¬ 
nally developed to analyze time evolution of topics in large document col¬ 
lections. DTM extends the popular topic modeling method called Latent 
Dirichlet Allocation (LDA) (Blei, Ng and Jordan, 2003). It uses Dirichlet 
prior based smoothing, which sometime over-smooth the data. As a con¬ 
sequence, it may cluster the data samples with non co-occurring features 
in the same group (Kim et ah, 2015). This eventually causes DTM to un¬ 
derperform to cluster some classical non-textual temporal categorical data. 
Recently, Kim et al. (2015) address this issue and proposed a probabilis¬ 
tic generative model based evolutionary clustering method, called Tempo¬ 
ral Multinomial Mixture (TMM). TMM extends the classical Multinomial 
Mixture (MM) model by incorporating temporal dependency into the re¬ 
lation between data components of current time epoch and the clusters of 
the previous time epoch. MM is a well-known standard probabilistic model, 
which has been widely used to cluster static discrete/categorical data (Meila 
and Heckerman, 2001; Silvestre, Cardoso and Figueiredo, 2014). Similar to 
MM, TMM estimates model parameters using an Expectation Maximization 
(EM) algorithm. Although both DTM and TMM provide reasonable results 
to cluster temporal categorical data, they are unable to detect and provide 
any interpretation of the cluster evolutions, which is one of the main foci of 
this research. Indeed, TMM is more related to our proposed approach as we 
aim to establish parametric link among MMs at different time epochs. 

The evolution monitoring task (Spiliopoulou et ah, 2006; Oliveira and 
Gama, 2010; Ferlez et ah, 2008; Lamirel, 2012) tracks the evolution of clus¬ 
ters by identifying the birth, death, split, merge and survival of clusters 
at different time. An external clustering method is hrst used at each time 
to cluster the data, e.g., Spiliopoulou et ah (2006) and Oliveira and Gama 
(2010) used the k-means method, whereas Lamirel (2012) used the neural 
clustering method. Afterward, the association and mapping among the clus¬ 
ters at different time is examined based on several heuristics. For example, 
Oliveira and Gama (2010) used cluster centroid related statistics, called com¬ 
prehensive representation of clusters. This approach is very similar to the 
notion of detecting recurrent concept drifts in a semi-supervised context, see 
Li, Wu and Hu (2012) for an example. A different method, called label-based 
diachronic approach (Lamirel, 2012), exploits the MultiView Data Analysis 
paradigm among the cluster labels at different time. In this approach, each 
feature is analyzed individually to compute recall, precision and F-measure. 
These information are used to construct heuristics for monitoring evolution. 
Our approach is different than the above methods, because: (a) we do not 
aim to propose a cluster monitoring method explicitly and (b) we do not use 
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a static clustering method. Besides the above methods, Ferlez et al. (2008) 
proposed a joint clustering-monitoring method which uses the cross associ¬ 
ation algorithm to cluster data and a bipartite graph to monitor evolution. 
For data clustering, they group the distinct features (word) in each clus¬ 
ter and hence features do not coexist in different clusters. This is different 
than us as we exploit all the features in order to provide a feature level 
interpretation for the evolution. 

The task of evolution interpretation aims to explain the reason for the 
evolution of clusters at different time. It can be accomplished by explicitly 
analyzing the features. To this aim, Lamirel (2012) used the F-measures 
from individual features of the matched clusters (of different time) and con¬ 
struct a similarity report. In our work, this interpretation can be directly 
obtained from the link parameters by applying threshold on the link param¬ 
eters values. Therefore, our method is different from Lamirel (2012) as the 
link parameters computation is an integral part of the clustering task. 

Based on the above distinctions from several viewpoints (clustering, mon¬ 
itoring and interpretation), we find that our method is more similar to the 
evolutionary clustering methods rather than the evolution monitoring meth¬ 
ods. Therefore, we compare our method only with the relevant state-of-the- 
art evolutionary clustering methods, such as Xu, Kliger and Hero lii (2014), 
Blei and Lafferty (2006) and Kim et al. (2015). 

Now we focus on the literature related to our proposal. The idea of para¬ 
metric link in a transfer learning context (Beninel et ah, 2012) is inherited 
from the concept for Generalized Discriminant Analysis (GDA) (Biernacki, 
Beninel and Bretagnolle, 2002). GDA adapts the classification rule from 
a source population to a target population through a linear link map of 
their descriptive parameters. This is different than standard discriminant 
rules which assumes a similarity between the source and target populations. 
Biernacki, Beninel and Bretagnolle (2002) proposed several models with as¬ 
sociated estimated parameters for GDA within the context of multivariate 
Gaussian distribution. Later, Jacques and Biernacki (2010) extends the work 
of Biernacki, Beninel and Bretagnolle (2002) for binary data using Bernoulli 
distribution (Bishop et ah, 2006). We observe that these approaches can 
be exploited for developing an evolutionary clustering method by replacing 
the notion of source/target with different time epochs t — l/t. Besides, such 
development requires the derivation of the linear link for the Multinomial 
distribution. Afterward, the link parameters naturally allow us to interpret 
the evolution of the clusters at different time. 

Gategorical data/observations consists of the responses from a certain 
number of categories. Different types (nominal and ordinal) of categorical 
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data are observed in numerous studies (Agresti, 2002), such as social science, 
biomedical science, genetics, education and marketing. Moreover, data from 
different tasks, such as text retrieval and visual object classification, are 
often converted to the categorical form. For example, text data can be con¬ 
verted to this form by considering the unique words of the vocabulary as an 
independent category/term and then each sentence/paragraph/document is 
represented as a discrete count vector (Zhong and Ghosh, 2005). The Multi¬ 
nomial distribution is a standard probability distribution for modeling and 
analyzing the discrete categorical data (Agresti, 2002). 

The Multinomial Mixture (MM) is a statistical model based on the Multi¬ 
nomial distribution. It has been used for cluster analysis with discrete data 
(Meila and Heckerman, 2001; Agresti, 2002; Zhong and Ghosh, 2005; Sil- 
vestre, Gardoso and Figueiredo, 2014; Hasnat et ah, 2015). Meila and Heck¬ 
erman (2001) studied several Model-Based Glustering (MBC) methods with 
MM and experimentally compared them using different criteria such as clus¬ 
tering accuracy, computation time and number of selected clusters. Silvestre, 
Cardoso and Figueiredo (2014) proposed a MBC method for MM which 
integrates both model estimation and selection task within a single EM 
algorithm. In their work, they extended the MBC strategy previously pro¬ 
posed by Figueiredo and Jain (2002) and provided a formulation to compute 
the Minimum Message Length (MML) criterion for model selection. Most 
recently, Hasnat et al. (2015) proposed a MBC method which performs si¬ 
multaneous clustering and model selection using the MM. Their strategy 
performs similar task as Silvestre, Cardoso and Figueiredo (2014) in a com¬ 
putationally efficient manner which has been previously proposed for the 
Gaussian distribution (Garcia and Nielsen, 2010) and Fisher distribution 
(Hasnat, Alata and Tremeau, 2015). Moreover, similar to Meila and Hecker¬ 
man (2001), they provided a comparison among different model initialization 
and selection strategies. Following all of the above approaches (Meila and 
Heckerman, 2001; Silvestre, Cardoso and Figueiredo, 2014; Hasnat et al., 
2015), in this research we exploit the MBC framework to cluster discrete 
data with MM. 

MBC (Fraley and Raftery, 2002; Melnykov and Maitra, 2010) is a well- 
established method for cluster analysis and unsupervised learning. It as¬ 
sumes a probabilistic model (e.g., mixture model) for the data, estimates 
the model parameters by optimizing an objective function (e.g., model likeli¬ 
hood) and produces probabilistic clustering. The Expectation Maximization 
(EM) (McLachlan and Krishnan, 2008) is mostly used in MBC to estimate 
the model parameters. EM consists of an Expectation step (E-step) and a 
Maximization step (M-step) which are iteratively employed to maximize the 
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log likelihood of the data. 

Initialization of the EM algorithm has signihcant impact on clustering re¬ 
sults (McLachlan and Krishnan, 2008; Baudry and Celeux, 2015). The EM 
algorithm is sensitive to its initialization, because with different initializa¬ 
tions it may converge to different values of likelihood function, some of which 
can be local maxima (i.e., sub-optimal results). In order to overcome this, 
numerous different initialization strategies are proposed and experimented 
in the relevant literature (Biernacki, Celeux and Govaert, 2003; Meila and 
Heckerman, 2001; Baudry and Celeux, 2015; Hasnat et ah, 2015). Eollow- 
ing recommendations, we use the small-EM (Biernacki, Celeux and Govaert, 
2003; Biernacki et ah, 2006; Baudry and Celeux, 2015; Hasnat et ah, 2015) 
method to initialize the MM parameters. 

MBC has been commonly exploited to identify the best model for the data 
by htting a set of models with different parameterizations and/or number of 
components and then applying a statistical model selection criterion (Eraley 
and Raftery, 2002; Biernacki, Celeux and Govaert, 2000; Eigueiredo and 
Jain, 2002; Melnykov and Maitra, 2010; Hasnat, Alata and Tremeau, 2015). 
In this paper, we apply this model htting and selection strategy for two 
purposes: (a) to identify the parametric submodels (Section 3.4) and (b) to 
automatically select the number of components (Section 3.7). 

3.2. Statistical model for evolutionary data samples. Let 5* be a set of 
samples corresponding to time t and 5'*"’“^ be a set from the next time t-\-l. 
We assume that while the cluster labels for S'* are known to us (estimated 
from t — 1), labels of S*"*"^ are unknown. 

Let S* be composed of pairs (x^, z^),..., (x)yt, z)yt) where x* = {x* 

..., X* £)} is the D dimensional count vector of order V, i.e., Yld=i ^\d~^ 
and Zj is the associated class label such that z* ^ = 1 if the data belongs 
to cluster k with k = 1,..., K and z* ^ = 0 otherwise. We assume that any 
sample x* of S* is an independent realization of the random variable X* of 
distribution: 


X*~AJ(E,4), k = l,...,K 


with A4(E, ^i\) is the E-order Multinomial distribution with parameter 
^,..., //^ £,) which is formally dehned as (Bishop et ah, 2006): 


(3.1) 



here, is the parameter of the Multinomial distribution of k^^ class with 
0 < pik,d < 1 and J2d=i = 1- Therefore, samples of the entire set S* 
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can be modeled with a mixture of k Multinomials, also called Multinomial 
Mixture (MM) model, which has the following form: 

K 

(3.2) / (Xi|0i^) = ^TTfc A^(xi|l/,/ifc) 

k=l 

In Eq. (3.2), Qk = {(^ijA^i)!...) (^A^Mic)} is the set of model parame¬ 
ters, TTfc is the mixing proportion with Ylk=i is the 

density function (Eq. (3.1)). Besides, we assume that the class label zj is 
an independent realization of a random vector Z*, distributed according to 
1-order Multinomial: 

Z*~M(l,7r*) 

where tt* = vr),..., is the mixing proportion of the model in Eq. (3.2). 

The assumption of MM is similar for the samples of 5'*'*“^ with random 
variable and parameter However, for the labels of 

jyi+l 

pairs (x^"*"^, ),..., (x^4i) are unknown. In the context of 

evolutionary clustering, our goal is to estimate the unknown labels z*^^ for 
i = 1,..., using the information from S'* and S*"*"* by establishing a 

link between fi\. and 


3.3. Parametric link/relationship among temporal data. Eor random vari¬ 
ables y* and y*+^ distributed according to the Gaussian distribution, a lin¬ 
ear distributional link exists (under weak assumptions) (Biernacki, Beninel 
and Bretagnolle, 2002), which has the form: y*+* ~ DY^ + b, where D and 
h are the link parameters among the samples of different time epoch. Eor 
binary data the following distributional linear link among Bernoulli param¬ 
eters (a*^* and a* with 0 < a < 1) is derived by Jacques and Biernacki 
( 2010 ): 

(3.3) = $ (<5$-^ (a*) + Ay) 

where 5 G ]R’''\{0}, A G { — 1,1} and 7 G M are the link parameters. $ is the 
cumulative Gaussian function of mean 0 and variance 1, see Fig. 3.1. We can 
use the above formulation for Multinomial parameters by considering two 
issues: (1) Multinomial parameter fij. has equivalent property as except 
Yld=i = 1 and (2) samples from X are not necessary to be binary, 
which makes A useless. Gonsidering these issues we can derive parametric 
link between and as: 


(3.4) 


4* (4,d ^ ^ (mm) + lk,d) 
Ef=l ^ (4,r + 7fc,r) 
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where 6k,d £ M+\{0} and 'yk,d £ are the link parameters. In Eq. (3.4), 
the combination of parameters 6k,d and 'yk,d for Vfc, d is called a full model 
which is over-parameterized and may leads to ambiguity. Instead, we con¬ 
sider several sub-models with certain constraints on the parameters, see the 
following section. 

3.4. Parametric sub-models. The idea of defining sub-models is frequent 
in Model-Based Clustering (MBC) (Fraley and Raftery, 2002). We fit the 
evolutionary clustering model (Eq. (3.4)) with different sub-models and then 
select the best model using the Bayesian Information Criteria (Schwarz et ah, 
1978): 

(3.5) BIC = -2L(0) + ulog (A^*+^) 

where T(0) is the log-likelihood (Eq. (3.6)) value associated to the MM 
parameters of f -|- 1 , is the number of free parameters of the sub-model. 
These sub-models provide sufficient interpretation about the change in pa¬ 
rameters from time t to f -|- 1. Definition and interpretation of several basic 
sub-models, defined as pair {6k,d/lk,d) are given below: 

(Ml) 1/0: This model is constrained with 6k,d = 1 and 'yk,d = 0 for VA:, d, 
i.e., z/ = 0. It indicates that the observations can be modeled with ^ 
and hence no evolution occurred. 

(M2) 0/'yk,d- This model is constrained with 6k,d = 0 for Vfc, d, i.e., 
u = K * D. It indicates that the observations should be modeled 

without considering This model should be selected when a new cluster 
evolved independently and does not consider any historical information. This 
is the most general model that can certainly fit the observations to 

a MM most efficiently subject to a good initialization of the alternative 
iterative method. Several possible variations^ of this model are: O/ 7 , 0 / 7 ^, 
and O/'Td- 

(M3) 6k,d/0' This model is constrained with 'yk,d = 0 for Vfc, d, i.e., 
1 / = K * D. It indicates that are evolved through ^ in a specific 
transformation space (inversed cumulative Gaussian). This model should 
be selected when true evolution occurred which can be explained in detail 
through certain belief on observed features and obtained clusters. Moreover, 
such a model can be plugged in with any other method in order to describe 
the cluster evolution. Several possible variations of this model are: <5/0, Sk/0 
and Sfi/0. This model is equivalent to the fundamental unconstrained model 
assumed by Biernacki, Beninel and Bretagnolle (2002). 

^Subscript k means cluster dependent and d means feature dependent. No subscription 
means a constant value for all clusters and features. 
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(M4) ll^k,d' In this model, 5k^d = 1 for VA:, d, i.e., v = K*D. This model 
does nearly similar task as model M3. It is relatively easier to fit through the 
additive term in the inverse cumulative Gaussian space. On the other hand, 
it is less expressive in terms of interpretation. Several possible variations of 
this model are: I/ 7 , 1 / 7 ^, and 1 / 7 ^. 

3.5. Parameter estimation. In our proposed formulation of evolutionary 
clustering, we estimate two different types of parameters (see Eq. (3.4)): (1) 
MM model parameters: ^ and vr and (2) temporal link parameters: 5 and 7 . 
We estimate them in two steps. The first step consists of estimating /r and 
TT (only for t = 1) for the observed samples of time t. In the second step, 
we estimate 6 and 7 . At any time epoch, we estimate the class labels Zj by 
maximum a posteriori. 


3.5.1. Multinomial Mixture (MM) Parameters. At time t = 1, we esti¬ 
mate the MM parameters using an Expectation Maximization (EM) algo¬ 
rithm that maximizes the log-likelihood value which has the following form: 

N K 

(3.6) L (0) = ^ log ^ TTjM (xi|/x^) 

i=i j=i 


where N = is the number of samples. In the Expectation step (E-step), 
we compute posterior probability as: 


lld=i lJ-k,d 

T~fD 

1^1 = 1 lld=l 1^1,0 


^i.d 


(3.7) 


Pi,k =PiZi,k = l|xj) = 


In the Maximization step (M-step), we update tt^ and fik,d as: 


(3.8) 


N 


1 ^ 


and pLk,d = 


2=1 


l^i=l Pi,k ^i,d 

Z^i=l A./r=l Pi,k 


The E and M steps are iteratively employed until certain convergence crite¬ 
rion (difference of the log-likelihood values of successive iterations) is satis¬ 
fied. The estimation of pk,d using Eq. (3.8) is only applicable for f = 1 due 
to the unavailability of any temporal information. For any time f -|- 1, when 
the link parameters are available, pk,d is estimated with Eq. (3.4). 


3.5.2. Link parameters. Estimation of link parameters 6k,d and 'yk,d uses 
^ and the observed samples at time Similar to Jacques and Biernacki 
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(2010), we use again an EM algorithm, but in which the M step is not ex¬ 
plicit. Consequently, we employ an external optimization method such as an 
alternative iterative algorithm which consists of a succession, componentwise 
of the simplex method^ (Nelder and Mead, 1965). In general, the starting 
point of the alternative algorithm corresponds to the case when 
i.e., 5k,d = 1 and 'yk,d = 0. However, in order to obtain a better estimate and 
save computation time we apply an efficient approach, see Section 3.6.2. 


Algorithm 1: Algorithm for clustering using parametric link among 
multinomial mixtures (PLMM). 

Input: X = y , S* = , Xi = , a;i.d £ N 

Output: Evolutionary clustering of y with K classes and link parameters: (5(, ^ and 
7id yk,d,t. 
foreach t do 
if t = 1 then 

Initialize -Kj^k and ki'j,k for 1 < j < k using the small-EM procedure, see 
Section 3.6.1; 

end 

while not converged do 

{Perform the E-step of EM}; 
foreach i and j do 

I Compute pik = p{zi,k = l|xd using Eq. (3.7) 

end 

{Perform the M-step of EM}; 
for fc = 1 to K do 
if t = 1 then 

I Update TTfe and pj, using Eq. (3.8) 
else 

Update TTfe using Eq. (3.8) 

Compute 5k,d and 'yk,d, see Sec. 3.5.2 
Update pif. using Eq. (3.4) 

end 

end 

end 

end 


3.6. Parameters initialization. In the proposed clustering method (Algo¬ 
rithm 1), we need to initialize both the MM parameters 0^** = {(vr}"'**, /t}™*), 
..., (tt^**, for time tl and the link parameters (<5 and 7 ). 

"^For the implementation, we used neldermead function of nloptr R package (Ypma, 
2014). The lower and upper bounds were set to —2.5 and -1-2.5 respectively only for the 
7 fe,d parameters. 

®The simplex method requires a large number of iterations to converge. 
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3.6.1. Multinomial Mixture (MM) Parameters. Generally, the MM pa¬ 
rameters are initialized randomly (Meila and Heckerman, 2001; Hasnat et al., 
2015). However, with both synthetic and real data it has been demonstrated 
by Hasnat et al. (2015) that, random initialization has its limitation w.r.t. 
the clustering performance and stability. Therefore, following Hasnat et al. 
(2015), we initialize the model parameters using the small-EM procedure. 
This small-EM procedure consists of running multiple short runs of ran¬ 
domly initialized EM and then selecting the one with the maximum likeli¬ 
hood value. Here, short run means that the EM procedure does not need 
to wait until convergence and it can be stopped when a certain number of 
iterations is completed. 

3.6.2. Link parameters. We propose an initialization procedure based on 

the predictive parameters set for next time epoch ... 

, (vr^®'^, . Let 0)^ = { (tt^, fi \),..., (vr)^, /x)^)} is the set of parameters 

for the current time (t) epoch. Our initialization procedure consists of the 
following steps: 

• Step 1: estimate using data samples of next time and an 

EM algorithm which is initialized with 0)^. 

• Step 2: compute 5^™* and 7 ^™* for each k and d as: 



(3.9) 


for model M2 



(3.10) 


for model M3 


(3.11) 7m = (/^m") - (/^m) for model M4 

The Eq. (3.9), (3.10) and (3.11) are simply derived from Eq. (3.4) with 
the consideration that denominator is equal to 1 , i.e., 'Yl!d=il^k,d = 1 for 
k = l,...,K. 


3.7. Varying number of clusters. The methodology presented in the pre¬ 
vious sub-sections assumes the same number of clusters K for each time 
epoch. In this sub section, we propose an extension of it such that the 
method can handle varying K at different time, i.e., Kt and may be 

different. To this aim, we modify the links initialization strategy (Section 
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3.6.2) in order to adapt the variability among 0)^^ and 0^^j • At time epoch 
t, this extended method requires additional information, such as: (a) number 
of clusters Kt+i and (b) cluster mapping between 0)^^ and • 

We adopted the method proposed by Hasnat et al. (2015) with L-method 
(Salvador and Chan, 2004) to select the number of cluster automatically at 
each time epoch. In order to initialize the link parameters, first we select 
the number of clusters Kt+i and obtain the predictive parameter set ■ 

Next, for each cluster k in we find the corresponding cluster in 0)^^ 

based on the minimum symmetric kullback leibler divergence (sKLD). sKLD 
among two clusters a and b is defined as (Hasnat et ak, 2015): 

,KLD = 

2 

(3.12) D 

Dkl (Ma>Hb) = 

d=l 

After establishing the correspondences, we use Eq. (3.9), (3.10) and (3.11) 
to set the initial values of the link parameters. Finally, we estimate the link 
parameters following Section 3.5.2. 

3.8. Interpretation of cluster evolution. The link parameters (5fc,d and 
"1k,d) along with the function d* are the key to interpret the cluster evolution. 
Let us notice some basic interpretation of the values of these parameters for 
all feature d and cluster k: 

• ^k,d = 0 means that p.k,d (probability) at t + 1 does not depend on t, 
whereas 5k,d = 1 (with 'yk,d = 0) means identical probability at two 
different times. 

• 5k,d —^ 0 and/or 'yk,d oo means that the distribution tends to uni¬ 
form distribution. 

• ^k,d oo and/or '^k,d ~oo means that the distribution tends to 
be more concentrated (Dirac distribution) at time t + 1 in the feature 
which has the highest probability at time t. 

In order to get further interpretation, we need to understand the Multi¬ 
nomial parameters Hk,d and the space spanned by the cumulative Gaussian 
and its inverse Let us consider an experiment of drawing V balls 
of d = different colors (represent features). After each draw, the 

color of the ball is recorded in a D dimensional count vector x* and the ball 
is replaced. Therefore, at the end of experiment ^i,d reveals the count 
of drawing the df^ colored ball. When a Multinomial distribution is used 
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$(0,1) - Cumulative Normal 



Fig 3.1. Illustrations of Cumulative Gaussian function and its relationship with the param¬ 
eter change of Multinomial distribution using Eg. (3.4). The arrows indicates the direction 
of changes in the inverse function space which eventually increase/decrease the probability. 


to fit such experimental data, its parameter iik,d reveals the probability of 
drawing the colored ball. 

Now, let us consider <1> in Fig. 3.1 where the values along the Y-axis 
represent the possible values of (with 0 < < 1) and the X-axis 

represents the values of ^ after transforming through function. Now, 
according to Eq. (3.4), cluster evolutions —)• can be explained 

through multiplication (using 5k,d) and addition/subtraction (using 7fc,(i) 
operations. 

The values of 'yk,d can certainly indicates the increase/decrease of the 
probability of certain feature (color) subject to the selection of sub-model 
M4. On the other hand if sub-model M3 is selected, values of 5k,d can 
explain the belief that should decrease if ^ < 0.5 and increase if 
fJ-kd^ example, let us consider that in a 2 colors (red and green) 

ball experiment the probability of the red color ball is changed from 0.8 (at 
time tl) to 0.7 (at time t2). Such a change can be explained with model M3 
with 5k,red = 0.6, which indicates that the belief is decreased at the next 
time. From the above discussions it is evident that the proposed method is 
capable to interpret the cluster evolutions up to the feature level. 

4. Numerical experiments. We begin the experiments using sim¬ 
ulated evolutionary data samples and evaluate w.r.t. the state-of-the-art 
methods. A characteristic comparison of different methods is presented in 
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Table 2. For the simulated samples; we use the Adjusted Rand Index (ARI) 
(Hubert and Arable, 1985) as a measure for evaluation. Next, we experi¬ 
ment and compare methods using real data. We use one of the real datasets 
experimented by Kim et al. (2015). We choose the political opinion dataset 
from the ImagiWeb project (Velcin et al., 2014) as it consists of data from 
an interesting time period - during and after the election. 

Table 2 

Characteristic comparison of different state-of-the-art evolutionary clustering methods: 
Parametric Link among Multinomial Mixtures (PLMM, our proposed method), Temporal 
Multinomial Mixture (TMM) (Kim et al, 2015), Dynamic Topic Model (DTM) (Blei 
and Lajferty, 2006) and adaptive evolutionary clustering method (AFFECT) (Xu, Kliger 

and Hero hi, 201)). 



PLMM 

DTM 

TMM 

AFFECT 

Data Type 

Discrete 

Discrete 

Discrete 

Continuous 

Interpret Evolution 

Yes 

No 

No 

No 


4.1. Simulated Data Samples. Following standard sampling methods we 
generate different sets of simulated data for different time epochs. 

We draw a finite set of categorical samples (discrete count vectors) S'* = 
with different numbers (10, 20 and 40) of features (dimensions) 
D. These samples are issued from Multinomial Mixture (MM) models of 
K = 3 classes. We consider two different sets of samples: 

• Samples with higher order of categorical count (hos) with V ~ 1.5 *14 

with 3 time epochs each having different number of i.i.d. samples: = 

500, N'^ = 100, and = 200. We also add noisy counts with these 
samples. These type of samples provides better resemblance with the 
MM parameters due to sufficient number of count in the observations. 
Practically, this is similar to the fact when the observations consists 
of data over longer period of time. 

• Samples with lower order of categorical count (los) with V ~ 0.7 * D 
with 5 time epochs each having different number of i.i.d. samples: 

= 50, N'^ = 40, = 40, = 30 and = 20. This type of 

samples are sparse and often difficult to distinguish among clusters. 
Practically, this is similar to the fact when the observations consists 
of data over shorter period of time. 

The evolutionary data generation process consists of two steps: (1) deter¬ 
mine MM parameters Hk,d at each time epoch t = 1,... ,T and (2) sample 
observations from the specified MM following assumption specified by Blei, 
Ng and Jordan (2003). For t = 1, we sample pk,d from a Dirichlet distribu¬ 
tion and verify (separation w.r.t. the other clusters parameters (Silvestre, 
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Cardoso and Figueiredo, 2014)) it using the symmetric Kullback-Leibler 
Divergence value. For t > 1, we sample ^k,d from using the MM link 
relationship defined in Eq. (3.4). This ensures that we maintain the temporal 
smoothness property (Chakrabarti, Kumar and Tomkins, 2006; Xu, Kliger 
and Hero lii, 2014) of the evolutionary data samples. In order to use the 
link relationship, we use only model M4 for hos data samples and randomly 
select a model among Ml, M3 and M4 for los data samples. Next, we set the 
associated link parameters {5k,d and '^k,d) randomly within a pre-specified 
range of values. 

To sample observations, first we choose the order 14 of each cluster. Our 
sampling procedure for each observation i at each time t follows the steps 
below: 

• Choose a cluster Zi^k = 1 as: z i M (l,vri ,... ,ttd) ,with.,TTd = p 

• Choose the order p of Multinomial for the sample Xj using Poisson 
distribution as: Tj ~ Poisson (I4J. 

• Draw sample Xj using Multinomial distribution as: Xj M{Ti,Hk,i, 
■ ■ ■ ,^J-k,D)■ 


Table 3 

Simulated data evaluation and comparison using Adjusted Rand Index (ARI) (Hubert 
and Arabie, 1985). Methods: PLMM (proposed), Dynamic Topic Model (DTM), 
Temporal Multinomial Mixture (TMM) and AFFECT with k-means. Datasets consist of 
different types (hos and los) of samples with different numbers (10, 20 and jO) of 
features, hos: higher order samples and los: lower order samples. Boldfaced indicate the 
best result and underlined numbers indicate second best. Values inside the parentheses 
provide the standard deviation of the ARI values. 



PLMM 

TMM 

DTM 

AFFECT 

10, hos 

0.91 (0.07) 

0.86 (0.11) 

0.79 (0.14) 

0.43 (0.12) 

10, los 

0.81 (0.19) 

0.91 (0.1) 

0.81 (0.1) 

0.34 (0.11) 

20, hos 

0.96 (0.05) 

0.91 (0.1) 

0.81 (0.18) 

0.37 (0.11) 

20, los 

0.90 (0.18) 

0.98 (0.04) 

0.95 (0.11) 

0.35 (0.09) 

40, hos 

0.97 (0.05) 

0.92 (0.11) 

0.48 (0.4) 

0.33 (0.11) 

40, los 

0.93 (0.16) 

0.97 (0.05) 

0.97 (0.1) 

0.36 (0.1) 


We applied our proposed Parametric Link among Multinomial Mixtures 
(PLMM, Algorithm 1) clustering method on these simulated data using 
the basic sub-models dehned in Sec. 3.4. Table 3 provides the results using 
the ARI (Hubert and Arabie, 1985) measure. Moreover, it provides a com¬ 
parative evaluation w.r.t. other state-of-the-art methods (see comparison 
in Table 2): (a) Temporal Multinomial Mixture (TMM) (Kim et ah, 2015) 
with smoothness parameter a = 1; (b) Dynamic Topic Model (DTM) (Blei 
and Lafferty, 2006) with hyper-parameter a = 0.01 and (c) Adaptive evolu- 
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tionary clustering method (AFFECT®) (Xu, Kliger and Hero lii, 2014) with 
k-means and Euclidean distance as a measure of similarity. We compute the 
average ARI of time t = 2,... ,T (at t = 1 there is no evolution). Results in 
Table 3 w.r.t. ARI evaluation shows that: 

• PLMM (proposed) provides highest ARI for the hos samples and 
TMM (Kim et ah, 2015) provides highest ARI for the los samples. 
These results are not surprising as both PLMM and TMM methods are 
specialized methods to cluster samples which are drawn from Multi¬ 
nomial distributions. 

• DTM (Blei and Lafferty, 2006) provides better results for los samples 
and higher dimensional data. This type of data is more likely to extract 
from text documents for which DTM was originally proposed. 

• APFECT (Xu, Kliger and Hero lii, 2014) performs poorly compares to 
others for both types of sample. This is expected because of the simi¬ 
larity measure used in AEEECT is appropriate for continuous data. 

Next, we test statistical hypothesis among PLMM, TMM and DTM using 
two sample t-test at the 5% significance level. The null hypothesis is that - 
the data in two results comes from independent random samples from normal 
distributions with equal means and equal but unknown variances. Results 
show that for all hos data the hypothesis is rejected with p-value<0.001. On 
the other hand, for the los data it is rejected only for 10 dimensional samples 
among the pairs (PLMM, TMM) and (DTM, TMM) with p-value<0.0001. 

Next, we analyze the evolution of the clusters in terms of selected sub¬ 
models. Table 4 provides the rate of different selected models. We see that, 
for the hos data samples the model M4 is mostly selected. On 

the other hand, for the los data samples, different models Ml: (1/0), M4: 
(l/7fe,d) s-iid M3: ((5fc,d/0) are selected at certain rate. This observation con¬ 
firms that PLMM successfully recovers the cluster evolutions with different 
models which were used to generate the simulated data. Interestingly, we 
observe that the model M2 {l/jk,d) is not selected which reflects the true 
fact that it was not considered to generate the simulated data samples. Now 
based on the selected model, we can provide further interpretation using 6k^d 
and see Sec. 3.4. 

Finally, we conduct experiments with varying number of clusters K at 
different time epoch. For this experiment, we use the same MM parameters 
which were used to generate the hos data samples. To ensure different K 
at different epoch, we randomly select a pair of time epochs and remove a 

®We experimented AFFECT with hierarchical and spectral clustering also. However, 
k-means provided the best results. 
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Table 4 

Percentage of the selected models for the interpretation of evaluation, hos: higher order 
(categorical count) samples and los: lower order samples. Boldfaced indicate the highest 

rate. 



Ml: (1/0) 

M4: ( 1 / 7 *;,d) 

M3: (4.d/0) 

M2: (0/7fc,d) 

10, hos 

0 % 

94 % 

6 % 

0 % 

10, los 

15 % 

38 % 

47 % 

0 % 

20, hos 

0 % 

92 % 

8 % 

0 % 

20, los 

14 % 

43 % 

43 % 

0 % 

40, hos 

0 % 

96 % 

4% 

0 % 

40, los 

4% 

37 % 

59 % 

0 % 


cluster from one of them. Then, we generate = 1000 synthetic 

data samples from them using the same procedure mentioned before. Ap¬ 
plying the extension of PLMM method (Section 3.7) on these data provides 
the following results (ARI): 0.967 (0.09) for d = 10, 0.988 (0.04) for d = 20 
and 0.986 (0.05) for d = 40. These results confirms that our proposed exten¬ 
sion can cluster the synthetic data with varying K and provides reasonable 
accuracy. 

4.2. IW-POD dataset. We consider three different methods. Dynamic 
Topic Model (DTM) (Blei and Lafferty, 2006), Temporal Multinomial Mix¬ 
ture (TMM) (Kim et ah, 2015) and Parametric Link among Multinomial 
Mixtures (PLMM), for a comparative evaluation of the performance on IW- 
POD dataset. These methods are selected based on their specialty to cluster 
discrete evolutionary/temporal data. We set 100 maximum number of itera¬ 
tions as the convergence criterion for all methods. Besides, we set the thresh¬ 
old log-likelihood difference values as 0.0001 for PLMM and TMM. The 
smoothness parameter a of TMM was set to 1. The DTM hyper-parameter 
a was set to 0.01. For the PLMM method, we consider the sub-models men¬ 
tioned in Sec. 3.4. 

IW-POD dataset does not provide ground truth cluster labels, due to 
which we were unable to evaluate clustering results with the known-labels 
based metric such as ARI. In this context, we evaluate the methods using 
a well known likelihood related measure called perplexity on a held-out test 
set (Murphy, 2012; Blei, Ng and Jordan, 2003). Perplexity is a quantity 
originally used in the field of language modeling (Murphy, 2012). It measures 
how well a model has captured the underlying distribution of language. In 
clustering context, perplexity is defined as the reciprocal geometric mean 
of the per feature (word) log-likelihood of a test set, which is computed 
using the model parameters learned with a training set. Therefore, the lower 
perplexity value indicates that the estimated (trained) model performs better 
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(a) (b) 



Fig 4.1. Comparison of different methods w.r.t. the perplexity values (lower is better) 
computed from the IW-POD data of two entities (row-1: Sarkozy and row-2: Hollande) and 
two time epochs (column-1: epoch t2 and column-2: epoch t3). Methods: Dynamic Topic 
Model (DTM) (Blei and Lafferty, 2006), Temporal Multinomial Mixture (TMM) (Kim 
et al, 2015) and our proposed Parametric Link among Multinomial Mixtures (PLMM) 
method. 
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to fit the test data. Perplexity can be formally defined as (Blei, Ng and 
Jordan, 2003): 


(4.1) 


perplexity ) = exp — 


L 


^train\ 


l^i=l 


where, V) is the total number of feature counts (words for document) in 
observation i, L (0*^“”) denotes the log-likelihood of the test data set com¬ 
puted using the trained model parameters and Eq. (3.6). 

In our experiments, for each time epoch t, we compute perplexity from 5 
folds of training-test data division and then take the average of 5 perplexity 
values as the final measure. For each fold, we used 80% data for training the 
model and obtain parameters 0*®’“®®® and the remaining 20% data to compute 
perplexity using Eq. (4.1). Fig. 4.1 illustrates the perplexity values computed 
from the data of two entities (row-1; Sarkozy and row-2: Hollande) and two 
time epochs (column-1; epoch t2 and column-2: epoch t3). Time epoch tl is 
not considered because it does not reflect the link relationship and temporal 
aspect of data clustering. 

Results in Fig. 4.1 show that, PLMM provides the best perplexity com¬ 
pared to DTM and TMM. This means that, compared to other methods, 
PLMM provides better fitting of the underlying Multinomial distribution to 
the test data. The next best (3 out of 4) method is the DTM followed by 
the TMM. Indeed, the results from TMM are intuitive as the fitted models 
are highly influenced by the other cluster components (Multinomial distri¬ 
butions) from the previous and next time epochs. In contrary, PLMM only 
consider the link from one cluster in the previous time epoch and fit the 
data accordingly. 

Fig. 4.2 provides a visual illustration of clustering results obtained from 
the above three methods. This illustration is obtained by using the Multi¬ 
dimensional scaling (Kruskal and Wish, 1978) technique where the distance 
matrix among the observations is computed by first converting the count 
vectors into probabilities and then using the sKLD (Eq. 3.12) as a measure 
of distance. The clustering results are obtained with K = 3, time epoch t2 
and the observations associated with the entity NS. From visual compar¬ 
ison among the plots in Fig. 4.2, we can say that PLMM provides better 
separation than TMM and DTM. Indeed, this observation agrees with the 
numerical results obtained with the perplexity values in Fig. 4.1(a) for K = 3. 


Next, we apply the extension of PLMM method (Section 3.7) with this 
dataset and observe the perplexity for time epochs t2 and t3. For the entity 
NS, we obtain average perplexity values as: t2 : 26.56 and t3 : 25.06 where 
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(a) PLMM 



X 

(b) TMM 


(c) DTM 


Fig 4.2. Illustration of clustering results visualized with Multidimensional scaling (Kruskal 
and Wish, 1978). Methods: (a) proposed Parametric Link among Multinomial Mixtures 
(PLMM); (b) Temporal Multinomial Mixture (TMM) (Kim et al., 2015) and (c) Dynamic 
Topic Model (DTM) (Blei and Lafferty, 2006). 
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average Kt 2 is 3 and average Kts is 5. For the entity FH, we obtain average 
perplexity values as: t2 : 13.08 and t3 : 5.17 where average Kt 2 is 4 and 
average Kts is 5. Compared to the results in Fig. 4.1 we see that, perplexity 
values increases (performance decreases) for entity NS and decreases (per¬ 
formance improves) for FH. Based on these observations, we can say that 
the extension of PLMM provides a good compromise in performance and 
works well for varying K at different epochs. We do not compare these re¬ 
sults with the TMM and DTM methods as they work with fixed K for all 
time epochs. 

Finally, let us focus on the interpretations of cluster evolutions in the 
IW-POD dataset. Table 5 provides the selection rate of different models at 
different time epochs (see Table 1 for details of time division). Listed rates 
provide us very interesting observations from which we can say that: 

• The opinions about NS were evolving almost similar way during and 
after the election period. These evolutions can be interpreted through 
the belief on aspects using models M3:(5fc^d/0) (93%) and M4:(l/7fc^rf) 
(7%). This indicates that during tl-t2-t3 opinions about NS were 
changing slowly. 

• Model M2:(0/7fc^d) is selected for all clusters of opinions about FH dur¬ 
ing tl-t2. This means that the opinions change significantly between tl 
and t2 period. From t2 to t3 (both after election period), opinions were 
evolving, which can be interpreted through the belief on the features 
with the models M4:(l/7fc ,i) (62%) and M3:((ifc^d/0) (38%). 

Table 5 

Selection rate of different models (Sec. 3-4) for the IW-POD dataset at different time 
epochs (see Table 1 for details of time division). 



Ml: (1/0) 

M4: (l/7fe,d) 

M3: (4,d/0) 

M2: (0/7fe,d) 

NS (tl-t2) 

0 % 

0 % 

100 % 

0 % 

NS (t2-t3) 

0 % 

13 % 

87 % 

0 % 

FH (tl-t2) 

0 % 

0 % 

0 % 

100 % 

FH (tl-t2) 

0 % 

62 % 

38 % 

0 % 


5. Analysis of the political opinion dataset. In this section, we 
perform analysis on the clustering results only from the PLMM method. 
In order to visualize the contents, we construct a histogram representation, 
which helps us to discriminate among different clusters. These histograms 
are constructed by counting the polarities (in vertical direction) w.r.t. each 
attribute (in horizontal direction). The color of the bars resembles the color 
of polarities. Fig. 5.1 illustrates an example of a histogram which is con- 
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Fig 5.1. Illustration of the clustering results using a histogram constructed from the po¬ 
larities of different aspects. The aspects are ordered from left to right as: (1) Attribute; 
(2) Balance sheet; (3) Communication; (4) Entity; (5) Ethic; (6) Injunction; (7) None; 
(8) Person; (9) Political line; (10) Project and (11) Skills. The polarities are colored and 
ordered from bottom to top as: -2 (dark blue), -1 (blue), 0 (light orange), 1 (orange), 2 
(red) and NULL (grey). 


structed from the tweets of a cluster from time t2. Following this illustration, 
in Fig. 5.2 and 5.3, let us look at the examples of the clusters at different 
time epochs for the entities NS and FH respectively. These results are ob¬ 
tained by clustering data using PLMM method with K = 3. From both 
figures we observe that, at each time epoch the clusters have different his¬ 
togram representations. Moreover, during different time epochs each cluster 
undergoes certain amount of changes in different attributes and associated 
polarities. This demonstrates that the proposed PLMM method is able to 
provide sufficient inter-cluster variations (at each time) while respecting the 
temporal dynamics (for each cluster during different time epochs). 

An alternative and compact representation (w.r.t. the MM model param¬ 
eters) of the clusters for NS is illustrated in Fig. 5.4(a) and 5.4(b). Simi¬ 
lar to the examples of Fig. 5.2, this alternative representation demonstrate 
that, at a certain time epoch different cluster emphasizes on different as- 
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pects/polarities of an entity. Besides, the temporal changes of the clusters 
can be identified subsequently during different epochs by observing the in¬ 
crease/decrease of the probabilities. However, from the user’s perspective, 
this representation may not be convenient to understand. Therefore, we use 
histograms for further analysis and use this compact representation for a 
different purpose. 

Now, let us explain the semantics obtained from these clustering results. 
For brevity, here we denote a cluster as cL. From Fig. 5.2 (clusters for NS) 
we see that, while cl. 1 and 3 emphasize on the negative (-) and positive (+) 
polarities respectively, cl. 2 emphasizes on a particular attribute. Naively 
we can say that, there are three groups of peoples: (a) the first group {cl. 
1) provides negative opinions from various aspects, thus tends to hold a 
negative image about the entity; (b) the second group {cl. 2) particularly 
emphasizes on Ethic of the entity and mostly provide negative opinions and 
(c) the third group {cl. 3) can be seen as a contrary to the first group {cl. 1) 
as it tends to hold a positive image about the entity. Table 6 provides three 
examples of the tweets for time tl and for each cluster about NS. We can 
realize that these tweets reflect the opinions which truly correspond to the 
groups obtained by the clustering method. 

From temporal viewpoint, we observe several changes w.r.t. different as¬ 
pects. In order to analyze the changes using histograms, we observe the 
height of histogram bar for each aspect. This height indicates the number 
of tweets/opinions corresponding to the related aspect. Let us consider an 
example of the aspect Communication which plays a certain role on clus¬ 
tering. We observe that: (a) for cl. 1, the total number of tweets related to 
the aspect Communication remains same during time tl and t2 and reduces 
during t2 and t3; (b) for cl. 2, the total number of tweets related to this 
aspect reduces continuously and (c) for cl. 3, the total number of tweets 
related to this aspect reduces from tl to t2 and remains same during t2 to 
t3. Moreover, a closer look on cl. 3 from t2 to t3 reveals an increase of posi¬ 
tive opinions about the communication skill of the entity. Another example 
is the aspect called Attribute, whose height reduces continuously with time 
for both cl. 1 and 3. Similarly, from an analysis of the height of histogram 
bars in Fig. 5.3 (clusters for FH) we see that, the aspects called Entity, 
Ethic, Political line, Skills and Communication play certain role to describe 
the image of FH. For example, the tweet - Holland would remove the word 
“race” in the Constitution (orig: Hollande supprimerait le mot “race” dans 
la Constitution) from time tl and cl. 3 is annotated with the aspect called 
political line and polarity +1. Another tweet - Holland and Netanyahu evoke 
the struggle against anti-Semitism (orig: Hollande et Netanyahou evoguent 
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Sarkozy before election - g3 (aize: 750) Sarkozy after eleclion - g3 (alze: 1034) Sarkozy attar aleellon - gSfaize: 833) 



Fig 5.2. Illustration of the clustering results from PLMM methods for NS. Results obtained 
using A" = 3 for three time epochs tl, t2 and t3. Each cluster is represented as a histogram 
constructed from the polarities of different aspects. The aspects are ordered from left to 
right as: (1) Attribute; (2) Balance sheet; (3) Communication; (4) Entity; (5) Ethic; (6) 
Injunction; (7) None; (8) Person; (9) Political line; (10) Project and (11) Skills. The 
polarities are colored and ordered from bottom to top as: -2 (dark blue), -1 (blue), 0 (light 
orange), 1 (orange), 2 (red) and NULL (grey). Each column represents clusters from a 
particular epoch. Each row represents a particular cluster in different epochs. 
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Fig 5.3. Illustration of the clustering results from PLMM methods for FH. Results obtained 
using K = 3 for three time epochs tl, t2 and t3. Each cluster is represented as a histogram 
constructed from the polarities of different aspects. The aspects are ordered from left to 
right as: (1) Attribute; (2) Balance sheet; (3) Communication; (4) Entity; (5) Ethic; (6) 
Injunction; (7) None; (8) Person; (9) Political line; (10) Project and (11) Skills. The 
polarities are colored and ordered from bottom to top as: -2 (dark blue), -1 (blue), 0 (light 
orange), 1 (orange), 2 (red) and NULL (grey). Each column represents clusters from a 
particular epoch. Each row represents a particular cluster in different epochs. 
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la lutte contre I’antisemitisme) has the same annotation which is from the 
same cluster but from time t3. These two examples reveal the importance of 
the aspect political line for keeping the similar opinions into the same group 
at different time. The above observations clearly indicate that, for different 
groups of people different aspects has certain importance at different time. 
Therefore, an analyst can retrieve the most prominent aspects from people’s 
opinion about an entity at a particular time or within a certain range of 
time periods. 

Besides the above interpretation of the clustering results, an analyst can 
obtain more information from the PLMM clustering results via the link 
parameters {5k,d or ^k,d)- After analyzing the links among MM parameters 
we notice that they are able to provide a compact explanation about the 
temporal changes during two time epochs. Fig. 5.4 illustrates an example 
for entity NS from time tl to t2 with 3 clusters, see column 1 and 2 of 
Fig. 5.2 for corresponding histograms. Fig. 5.4(a) and Fig. 5.4(b) illustrates 
the MM parameters (probability of aspect-polarity features) and Fig. 5.4(c) 
provides a compact representation about the cluster evolutions using the 
values of 5k,d- To better understand this representation in Fig. 5.4(c), we 
transform the link values as 0 (no change), -1 {5k,d < 0.9, belief increases) 
and -|-1 {5k,d > IT, belief decreases). In the context of the examples from 
the IW-POD, we can explain belief as: probability of a feature at time t + 1 
is increased from its probability at time t. Therefore, the belief indicates the 
relative significance of a particular feature w.r.t. time. An increase in the 
belief means that users tend to be more attracted by it. Following this, if a 
feature probability is nearly same at two different times then belief remains 
unchanged. In Fig. 5.4, we highlight the effect of a particular aspect, called 
Communication {Com), and observe its contribution for cluster evolution. 
From Fig. 5.4 (a) and (b) we see that, from time tl to t2 the probabilities are 
decreased mostly for cl. 2 and 3. This means that, either the users from these 
clusters loose interest to discuss about Com and focus on other aspects, or 
those users disappeared at time t2. Similar to Com, we can observe other 
aspects such as Eth {cl. 1 and cl. 3) and Ent {cl. 2 and cl. 3) which causes 
cluster evolution in this example of Fig. 5.4. 

Let us analyze examples from real twitter data and observe them w.r.t. 
the Fig. 5.4. If we look at cl. 3 at time tl (before election), the most likely 
features are often positive and it is clear that it gathers people in favor of NS. 
The prominent aspects are Att (positive and neutral), Ent (positive) and 
Inj (positive), such as in the tweet -40 people @youngpop44 'will be present 
at the great gathering in Place #Concorde for supporting @NicolasSarkozy 
! #StrongFrance #NS2012”. This cluster slightly changes later at time t2 
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Table 6 

Real twitter data examples of the 3 clusters at time tl for entity NS. See Fig. 5.2 column 

1 for the associated histograms. 



Cluster 1 (Generally Negative) 

Ex. 1 

Orig: 11 veut desreferendums car... y a pas de pilote dans I’avion, 
dit-il: quel aveu! ^Sarkozy^projet 

Trans: He wants referendumbecause... there is no pilot in the plane he says: 
what a confession! ^Sarkozy^project 

Ex. 2 

Orig: Je ne voterais pas ^Sarkozy ! ” ” Je ne voterais pas ^Sarkozy ! 

Trans: I won’t vote for ^Sarkozy I” ” I won’t vote for ^Sarkozy 

Ex. 3 

Orig:Nicolas Sarkozy, le plus mauvais president de la Veme Republlque 

Trans: Nicolas Sarkozy, the worst president of the Fifth Republic 


Cluster 2 (Negative, specially ’’Ethic”) 

Ex. 1 

Orig: Jamais un president n’a ete cerne par tant d’affaires! demain ds 
@lematinch ^Bettencourt ^Sarkozy 

Trans: Never before a president was surrounded by so many cases!” tomorrow in 
©lematinch ^Bettencourt ^Sarkozy 

Ex. 2 

Orig: Une liste de condamnes de I’^^UMP qui pourrait etre bientot completee par les noms de 
^Sarkozy, #Cope, #Woerth 

Trans: A list of convicted people of ^UMP soon completed by names such as 
#Sarkozy, ffCope, #Woerth 

(the “Bettencourt case” is a famous case in which Sarkozy was involved) 

Ex. 3 

Orig: Sarkozy-Kadhafi: la preuve du hnancement. Et I’urgence d’une 
enquete officielle ^affairedetat 

Trans: Sarkozy-Kadhafi: the proof of funding. And the urge of an 
official enquiry ^stateaffair 

(Kadhafi is another case in which Sarkozy was involved in some way) 


Cluster 3 (Generally Positive) 

Ex. 1 

Orig: N Sarkosy mots cle..challenge, defl, action, travail, reussite, formation, effort, 
individualisation ..France Forte. Europe Forte i)^NS2012 

Trans: N Sarkozy keywords..challenge, defl, action, work, success, training, effort, 
individualization ..Strong France. Strong Europe #NS2012 

Ex. 2 

Orig: merci N.Sarkozy pour tout tu restera pour toujour mon Hero merci. merci 

Trans: Thank you N.Sarkozy for all you will stay my hero forever thanks, thanks 

Ex. 3 

Orig: Sarko est plus rationneh. 

Trans: Sarko is more rational.. 
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(b) 



(c) 


Fig 5.4. Example of evolution interpretation using link parameter Sk,d for NS during tl 
to t2 with 3 clusters, (a) MM parameters at time tl (b) MM parameters at time 
t2 (c) Link parameters 5k,j between time tl and t2. In (c), for each cluster (row-wise), 
brighter/white color indicates the prior belief about features (aspect-polarity) increases, 
darker/black color indicates the prior belief about features decreases and grey color indicates 
the prior belief about features remains same. 
















































32 


HASNAT ET AL. 


(just after election) towards Att (positive), Ent (positive) and Bal (positive). 
The shift from Inj to Bal is clearly visible on Fig. 5.4(c), third row: black 
color for Inj means a decrease of attention whereas white color for Bal means 
there are relatively more comments on the balance sheet of NS. Hence, the 
following message shows some nostalgia felt by many militants: Whatever 
the opinion of FH, NS has been a great president. FH can deconstruct all the 
reforms, we will never forget!. To sum up, the 6 parameter helps us to focus 
on what are the main changes, even though the observation could have been 
drawn among the other aspects. Following the same reasoning, all polarities 
targeting the aspect Com are black, which proves that the performances of 
the politician in the media (e.g., TV, newspapers) are less important once 
the election is over. 

Observations from numerous experiments reveal that, besides performing 
evolutionary clustering on the temporal data, PLMM also provide reasonable 
interpretation for the evolutions, thanks to the link parameters. Indeed, this 
clearly distinguishes PLMM from the rest of the state-of-the-art methods. 
Moreover, we notice that the interpretability of PLMM (using Eq. 3.9, 3.10 
and 3.11) can be separated out and externally plugged in with the results 
from any other discrete data clustering methods. 

6. Conclusion and Future Perspectives. Over the years, a large 
number of temporal data analysis methods have been proposed in several 
domains. In this paper, we only focused on the particular clustering methods 
which have been used for discrete data clustering and which are based on 
the assumption of the Multinomial distribution. 

We proposed an unsupervised method (i.e., no training from labeled data) 
for analyzing the temporal data. The core element of our proposal is the 
formulation of parametric links among the multinomial distributions. Com¬ 
putations of these links naturally cluster the evolutionary/temporal data. 
Furthermore, these links can provide interpretation for cluster evolution 
and also detect clusters evolution in certain cases. For experimental vali¬ 
dation, we extensively used synthetic dataset and evaluated using the Ad¬ 
justed Rand Index. As a practical application, we applied it on a dataset 
of political opinions and evaluated using Perplexity measure. Results show 
that the proposed method, called PLMM, is better than the state-of-the-art. 
Moreover, it provides an additional advantage through the link parameters 
in order to interpret the changes in clusters at different time. We also pro¬ 
vide an extension of the proposed method for dealing with varying number 
of clusters which is not addressed by most of the recent methods. 

Monitoring/tracking cluster evolution is an interesting issue which we do 
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not explicitly and extensively manage in our proposed method, because it 
is not a primary objective in this paper. Yet, we can partially achieve this 
task by using certain information (parametric sub-models, see 3.4) which are 
naturally integrated with our proposed method. That means, our proposed 
method can be used only as a detector of cluster evolution. At present, we 
consider the complete monitoring task as a future work. We believe that, an 
extension of several existing work can be added with our method to com¬ 
pletely deal with this issue. For example, we can exploit^ MFC (Oliveira and 
Gama, 2010) which is a cluster evolution monitoring method for continuous 
data. Besides, we can use label-based diachronic approach (Lamirel, 2012) 
by externally providing our clustering results as an input to it. 

Computational complexity is a concern for the proposed method and can 
be considered as a limitation. From a decomposition of the computational 
time, we observe that most of the time is consumed by the optimization 
procedure {neldermead simplex method). In future, a better optimization 
method can be incorporated to address this issue. Moreover, the time can 
be further reduced by eliminating the parametric sub-models which are ex¬ 
perimentally found as redundant. 

Although we demonstrated the effectiveness of the proposed method only 
for political opinion dataset, we believe that it will be equally effective for 
different datasets that consist of the form of categorical data. 
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